Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
نویسندگان
چکیده
We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( √ T log |Π| + log |Π|) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs.
منابع مشابه
Thompson Sampling for Learning Parameterized Markov Decision Processes
We consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning proble...
متن کاملUtilizing Generalized Learning Automata for Finding Optimal Policies in MMDPs
Multi agent Markov decision processes (MMDPs), as the generalization of Markov decision processes to the multi agent case, have long been used for modeling multi agent system and are used as a suitable framework for Multi agent Reinforcement Learning. In this paper, a generalized learning automata based algorithm for finding optimal policies in MMDP is proposed. In the proposed algorithm, MMDP ...
متن کاملModel-Checking Markov Chains in the Presence of Uncertainties
We investigate the problem of model checking Interval-valued Discrete-time Markov Chains (IDTMC). IDTMCs are discrete-time finite Markov Chains for which the exact transition probabilities are not known. Instead in IDTMCs, each transition is associated with an interval in which the actual transition probability must lie. We consider two semantic interpretations for the uncertainty in the transi...
متن کاملRegret Minimization in Nonstationary Markov Decision Processes
We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion to some extent. We propose online learning algorithms and provide guarantees on their performance evaluated in retrospect against stationary policies. Unlike previous works, the guarantees depend critically on the variabilit...
متن کاملApproximation of Large Probabilistic Networks by Structured Population Protocols
We consider networks of Markov Decision Processes (MDPs) where each MDP is one of the N nodes of a graph G. The transition probabilities of an MDP depend on the states of its direct neighbors in the graph, and runs operate by selecting a random node and following a random transition in the chosen device MDP. As the state space of all the configurations of the network is exponential in N, classi...
متن کامل